The analysis is based on the LifeExpectancy.csv dataset, which contains repeated country-level observations from 2000 to 2015 for 179 countries. The dataset includes demographic, health, and socioeconomic indicators such as infant mortality, BMI, HIV incidence, GDP per capita, schooling, child thinness, measles incidence, region, and economic development status, with life expectancy as the primary continuous outcome. For different modeling tasks, the data were subset by year, transformed to create a binary infant-mortality outcome (infant deaths ≥30), and split into training, validation, and testing sets. These data were then used to evaluate Ridge and LASSO models for classification and generalized additive models with splines for nonlinear prediction of life expectancy.
knitr::opts_chunk$set(echo = TRUE)
library(splines)
library(caret) #confusionMatrix
library(ISLR2)
library(gam)
library(tidyverse)
library(glmnet)
library(boot)
library(vcd) # for mosaic plots
library(leaps)
RNGkind(kind = NULL, normal.kind = NULL, sample.kind = NULL)
## [1] "Mersenne-Twister" "Inversion" "Rejection"
set.seed(2345, sample.kind="Rounding")
## Warning in set.seed(2345, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
sample(1:10,5)
## [1] 2 10 6 1 3
rm(list=ls())
df <- read.csv("LifeExpectancy.csv")
str(df)
## 'data.frame': 2864 obs. of 13 variables:
## $ Country : chr "Turkiye" "Spain" "India" "Guyana" ...
## $ Region : chr "Middle East" "European Union" "Asia" "South America" ...
## $ Year : int 2015 2015 2007 2006 2012 2006 2015 2000 2001 2008 ...
## $ Infant_deaths : num 11.1 2.7 51.5 32.8 3.4 9.8 6.6 8.7 22 15.3 ...
## $ Measles : int 65 94 35 74 89 86 97 99 87 92 ...
## $ BMI : num 27.8 26 21.2 25.3 27 26.4 26.2 25.9 27.9 26.5 ...
## $ Incidents_HIV : num 0.08 0.09 0.13 0.79 0.08 0.16 0.08 0.08 0.13 0.43 ...
## $ GDP_per_capita : int 11006 25742 1076 4146 33995 9110 9313 8971 3708 2235 ...
## $ Thinness_five_nine_years : num 4.8 0.5 28 5.5 1.1 1.9 2.3 2.3 3.9 3.1 ...
## $ Schooling : num 7.8 9.7 5 7.9 12.8 7.9 12 10.2 9.6 10.9 ...
## $ Economy_status_Developed : int 0 1 0 0 1 0 0 1 0 0 ...
## $ Economy_status_Developing: int 1 0 1 1 0 1 1 0 1 1 ...
## $ Life_expectancy : num 76.5 82.8 65.4 67 81.7 78.2 71.2 71.2 71.9 68.7 ...
head(df)
## Country Region Year Infant_deaths Measles BMI
## 1 Turkiye Middle East 2015 11.1 65 27.8
## 2 Spain European Union 2015 2.7 94 26.0
## 3 India Asia 2007 51.5 35 21.2
## 4 Guyana South America 2006 32.8 74 25.3
## 5 Israel Middle East 2012 3.4 89 27.0
## 6 Costa Rica Central America and Caribbean 2006 9.8 86 26.4
## Incidents_HIV GDP_per_capita Thinness_five_nine_years Schooling
## 1 0.08 11006 4.8 7.8
## 2 0.09 25742 0.5 9.7
## 3 0.13 1076 28.0 5.0
## 4 0.79 4146 5.5 7.9
## 5 0.08 33995 1.1 12.8
## 6 0.16 9110 1.9 7.9
## Economy_status_Developed Economy_status_Developing Life_expectancy
## 1 0 1 76.5
## 2 1 0 82.8
## 3 0 1 65.4
## 4 0 1 67.0
## 5 1 0 81.7
## 6 0 1 78.2
dim(df)
## [1] 2864 13
# Check for missing values
sum(is.na(df))
## [1] 0
# Check the ranges of continuous variables
ranges <- sapply(df[, sapply(df, is.numeric)],
function(x) c(min = min(x, na.rm = TRUE),
max = max(x, na.rm = TRUE)))
t(ranges)
## min max
## Year 2000.00 2015.00
## Infant_deaths 1.80 138.10
## Measles 10.00 99.00
## BMI 19.80 32.10
## Incidents_HIV 0.01 21.68
## GDP_per_capita 148.00 112418.00
## Thinness_five_nine_years 0.10 28.60
## Schooling 1.10 14.10
## Economy_status_Developed 0.00 1.00
## Economy_status_Developing 0.00 1.00
## Life_expectancy 39.40 83.80
Comment: Ther are no missing values in the dataset. Based on the observed ranges, all variables appear to fall within realistic and commonly reported values. For example, Year spans 2000–2015, country-level BMI averages fall between 19.8 and 32.1, infant mortality ranges from 1.8 to 138.1 per 1,000 live births, GDP per capita ranges from 148 to 112,418 USD, and life expectancy spans 39.4–83.8 years. These values are all plausible for global country-level data, so none of the variables show abnormal or implausible ranges.
# Convert Country, Region, Economy_status_Developed, and Economy_status_Developing as factor variables
df <- df %>% mutate(
Country = as.factor(Country),
Region = as.factor(Region),
Economy_status_Developed = as.factor(Economy_status_Developed),
Economy_status_Developing = as.factor(Economy_status_Developing)
)
# Check the factor variables
table(df$Country)
##
## Afghanistan Albania
## 16 16
## Algeria Angola
## 16 16
## Antigua and Barbuda Argentina
## 16 16
## Armenia Australia
## 16 16
## Austria Azerbaijan
## 16 16
## Bahamas, The Bahrain
## 16 16
## Bangladesh Barbados
## 16 16
## Belarus Belgium
## 16 16
## Belize Benin
## 16 16
## Bhutan Bolivia
## 16 16
## Bosnia and Herzegovina Botswana
## 16 16
## Brazil Brunei Darussalam
## 16 16
## Bulgaria Burkina Faso
## 16 16
## Burundi Cabo Verde
## 16 16
## Cambodia Cameroon
## 16 16
## Canada Central African Republic
## 16 16
## Chad Chile
## 16 16
## China Colombia
## 16 16
## Comoros Congo, Dem. Rep.
## 16 16
## Congo, Rep. Costa Rica
## 16 16
## Cote d'Ivoire Croatia
## 16 16
## Cuba Cyprus
## 16 16
## Czechia Denmark
## 16 16
## Djibouti Dominican Republic
## 16 16
## Ecuador Egypt, Arab Rep.
## 16 16
## El Salvador Equatorial Guinea
## 16 16
## Eritrea Estonia
## 16 16
## Eswatini Ethiopia
## 16 16
## Fiji Finland
## 16 16
## France Gabon
## 16 16
## Gambia, The Georgia
## 16 16
## Germany Ghana
## 16 16
## Greece Grenada
## 16 16
## Guatemala Guinea
## 16 16
## Guinea-Bissau Guyana
## 16 16
## Haiti Honduras
## 16 16
## Hungary Iceland
## 16 16
## India Indonesia
## 16 16
## Iran, Islamic Rep. Iraq
## 16 16
## Ireland Israel
## 16 16
## Italy Jamaica
## 16 16
## Japan Jordan
## 16 16
## Kazakhstan Kenya
## 16 16
## Kiribati Kuwait
## 16 16
## Kyrgyz Republic Lao PDR
## 16 16
## Latvia Lebanon
## 16 16
## Lesotho Liberia
## 16 16
## Libya Lithuania
## 16 16
## Luxembourg Madagascar
## 16 16
## Malawi Malaysia
## 16 16
## Maldives Mali
## 16 16
## Malta Mauritania
## 16 16
## Mauritius Mexico
## 16 16
## Micronesia, Fed. Sts. Moldova
## 16 16
## Mongolia Montenegro
## 16 16
## Morocco Mozambique
## 16 16
## Myanmar Namibia
## 16 16
## Nepal Netherlands
## 16 16
## New Zealand Nicaragua
## 16 16
## Niger Nigeria
## 16 16
## North Macedonia Norway
## 16 16
## Oman Pakistan
## 16 16
## Panama Papua New Guinea
## 16 16
## Paraguay Peru
## 16 16
## Philippines Poland
## 16 16
## Portugal Qatar
## 16 16
## Romania Russian Federation
## 16 16
## Rwanda Samoa
## 16 16
## Sao Tome and Principe Saudi Arabia
## 16 16
## Senegal Serbia
## 16 16
## Seychelles Sierra Leone
## 16 16
## Singapore Slovak Republic
## 16 16
## Slovenia Solomon Islands
## 16 16
## Somalia South Africa
## 16 16
## Spain Sri Lanka
## 16 16
## St. Lucia St. Vincent and the Grenadines
## 16 16
## Suriname Sweden
## 16 16
## Switzerland Syrian Arab Republic
## 16 16
## Tajikistan Tanzania
## 16 16
## Thailand Timor-Leste
## 16 16
## Togo Tonga
## 16 16
## Trinidad and Tobago Tunisia
## 16 16
## Turkiye Turkmenistan
## 16 16
## Uganda Ukraine
## 16 16
## United Arab Emirates United Kingdom
## 16 16
## United States Uruguay
## 16 16
## Uzbekistan Vanuatu
## 16 16
## Venezuela, RB Vietnam
## 16 16
## Yemen, Rep. Zambia
## 16 16
## Zimbabwe
## 16
table(df$Region)
##
## Africa Asia
## 816 432
## Central America and Caribbean European Union
## 304 432
## Middle East North America
## 224 48
## Oceania Rest of Europe
## 176 240
## South America
## 192
table(df$Economy_status_Developed)
##
## 0 1
## 2272 592
table(df$Economy_status_Developing)
##
## 0 1
## 592 2272
Comment: I’ve converted Country, Region, Economy_status_Developed, and Economy_status_Developing as factor variables. They were converted well based on the output.
summary(df)
## Country Region Year
## Afghanistan : 16 Africa :816 Min. :2000
## Albania : 16 Asia :432 1st Qu.:2004
## Algeria : 16 European Union :432 Median :2008
## Angola : 16 Central America and Caribbean:304 Mean :2008
## Antigua and Barbuda: 16 Rest of Europe :240 3rd Qu.:2011
## Argentina : 16 Middle East :224 Max. :2015
## (Other) :2768 (Other) :416
## Infant_deaths Measles BMI Incidents_HIV
## Min. : 1.80 Min. :10.00 Min. :19.80 Min. : 0.0100
## 1st Qu.: 8.10 1st Qu.:64.00 1st Qu.:23.20 1st Qu.: 0.0800
## Median : 19.60 Median :83.00 Median :25.50 Median : 0.1500
## Mean : 30.36 Mean :77.34 Mean :25.03 Mean : 0.8943
## 3rd Qu.: 47.35 3rd Qu.:93.00 3rd Qu.:26.40 3rd Qu.: 0.4600
## Max. :138.10 Max. :99.00 Max. :32.10 Max. :21.6800
##
## GDP_per_capita Thinness_five_nine_years Schooling
## Min. : 148 Min. : 0.1 Min. : 1.100
## 1st Qu.: 1416 1st Qu.: 1.6 1st Qu.: 5.100
## Median : 4217 Median : 3.4 Median : 7.800
## Mean : 11541 Mean : 4.9 Mean : 7.632
## 3rd Qu.: 12557 3rd Qu.: 7.3 3rd Qu.:10.300
## Max. :112418 Max. :28.6 Max. :14.100
##
## Economy_status_Developed Economy_status_Developing Life_expectancy
## 0:2272 0: 592 Min. :39.40
## 1: 592 1:2272 1st Qu.:62.70
## Median :71.40
## Mean :68.86
## 3rd Qu.:75.40
## Max. :83.80
##
# histograms for numeric columns to visually assess distributions
df %>%
select_if(is.numeric) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~ variable, scales = "free", ncol = 2) +
labs(title = "Distributions of Continuous Variables")
Comment: I checked the continuous variables using summary statistics and visualized their distributions with histograms. No variables appeared incorrectly coded as numeric indicators, and the observed ranges were consistent with expectations. Several variables showed strong right-skewness, but no irregular values suggesting data coding errors were identified.
summary(df)
## Country Region Year
## Afghanistan : 16 Africa :816 Min. :2000
## Albania : 16 Asia :432 1st Qu.:2004
## Algeria : 16 European Union :432 Median :2008
## Angola : 16 Central America and Caribbean:304 Mean :2008
## Antigua and Barbuda: 16 Rest of Europe :240 3rd Qu.:2011
## Argentina : 16 Middle East :224 Max. :2015
## (Other) :2768 (Other) :416
## Infant_deaths Measles BMI Incidents_HIV
## Min. : 1.80 Min. :10.00 Min. :19.80 Min. : 0.0100
## 1st Qu.: 8.10 1st Qu.:64.00 1st Qu.:23.20 1st Qu.: 0.0800
## Median : 19.60 Median :83.00 Median :25.50 Median : 0.1500
## Mean : 30.36 Mean :77.34 Mean :25.03 Mean : 0.8943
## 3rd Qu.: 47.35 3rd Qu.:93.00 3rd Qu.:26.40 3rd Qu.: 0.4600
## Max. :138.10 Max. :99.00 Max. :32.10 Max. :21.6800
##
## GDP_per_capita Thinness_five_nine_years Schooling
## Min. : 148 Min. : 0.1 Min. : 1.100
## 1st Qu.: 1416 1st Qu.: 1.6 1st Qu.: 5.100
## Median : 4217 Median : 3.4 Median : 7.800
## Mean : 11541 Mean : 4.9 Mean : 7.632
## 3rd Qu.: 12557 3rd Qu.: 7.3 3rd Qu.:10.300
## Max. :112418 Max. :28.6 Max. :14.100
##
## Economy_status_Developed Economy_status_Developing Life_expectancy
## 0:2272 0: 592 Min. :39.40
## 1: 592 1:2272 1st Qu.:62.70
## Median :71.40
## Mean :68.86
## 3rd Qu.:75.40
## Max. :83.80
##
# Histogram of numeric variables
hist(df$Year)
hist(df$Infant_deaths)
hist(df$Measles)
hist(df$BMI)
hist(df$Incidents_HIV)
hist(df$GDP_per_capita)
hist(df$Thinness_five_nine_years)
hist(df$Schooling)
hist(df$Life_expectancy)
# List of numeric variables
num_vars <- c("Infant_deaths", "Measles", "BMI", "Incidents_HIV", "GDP_per_capita", "Thinness_five_nine_years", "Schooling", "Life_expectancy")
# Correlation among numeric variables
cor_matrix <- cor(df[num_vars], use = "pairwise.complete.obs")
round(cor_matrix[1:6, 1:6], 2)
## Infant_deaths Measles BMI Incidents_HIV
## Infant_deaths 1.00 -0.53 -0.66 0.35
## Measles -0.53 1.00 0.42 -0.15
## BMI -0.66 0.42 1.00 -0.16
## Incidents_HIV 0.35 -0.15 -0.16 1.00
## GDP_per_capita -0.51 0.31 0.34 -0.17
## Thinness_five_nine_years 0.48 -0.37 -0.60 0.19
## GDP_per_capita Thinness_five_nine_years
## Infant_deaths -0.51 0.48
## Measles 0.31 -0.37
## BMI 0.34 -0.60
## Incidents_HIV -0.17 0.19
## GDP_per_capita 1.00 -0.38
## Thinness_five_nine_years -0.38 1.00
# Graphical summaries
par(mfrow = c(2, 3))
for (v in num_vars) {
hist(df[[v]], main = v, col = "lightblue", border = "white")
}
# Pairwise scatterplot matrix
pairs(df[, num_vars[1:5]], main = "Pairwise scatterplot matrix")
# Summary of categorical variables
summary(df[, c("Country", "Region", "Economy_status_Developed", "Economy_status_Developing")])
## Country Region
## Afghanistan : 16 Africa :816
## Albania : 16 Asia :432
## Algeria : 16 European Union :432
## Angola : 16 Central America and Caribbean:304
## Antigua and Barbuda: 16 Rest of Europe :240
## Argentina : 16 Middle East :224
## (Other) :2768 (Other) :416
## Economy_status_Developed Economy_status_Developing
## 0:2272 0: 592
## 1: 592 1:2272
##
##
##
##
##
cat_vars <- df[, sapply(df, is.character) | sapply(df, is.factor)]
lapply(cat_vars, function(x) {
list(
table = table(x),
proportion = prop.table(table(x))
)
})
## $Country
## $Country$table
## x
## Afghanistan Albania
## 16 16
## Algeria Angola
## 16 16
## Antigua and Barbuda Argentina
## 16 16
## Armenia Australia
## 16 16
## Austria Azerbaijan
## 16 16
## Bahamas, The Bahrain
## 16 16
## Bangladesh Barbados
## 16 16
## Belarus Belgium
## 16 16
## Belize Benin
## 16 16
## Bhutan Bolivia
## 16 16
## Bosnia and Herzegovina Botswana
## 16 16
## Brazil Brunei Darussalam
## 16 16
## Bulgaria Burkina Faso
## 16 16
## Burundi Cabo Verde
## 16 16
## Cambodia Cameroon
## 16 16
## Canada Central African Republic
## 16 16
## Chad Chile
## 16 16
## China Colombia
## 16 16
## Comoros Congo, Dem. Rep.
## 16 16
## Congo, Rep. Costa Rica
## 16 16
## Cote d'Ivoire Croatia
## 16 16
## Cuba Cyprus
## 16 16
## Czechia Denmark
## 16 16
## Djibouti Dominican Republic
## 16 16
## Ecuador Egypt, Arab Rep.
## 16 16
## El Salvador Equatorial Guinea
## 16 16
## Eritrea Estonia
## 16 16
## Eswatini Ethiopia
## 16 16
## Fiji Finland
## 16 16
## France Gabon
## 16 16
## Gambia, The Georgia
## 16 16
## Germany Ghana
## 16 16
## Greece Grenada
## 16 16
## Guatemala Guinea
## 16 16
## Guinea-Bissau Guyana
## 16 16
## Haiti Honduras
## 16 16
## Hungary Iceland
## 16 16
## India Indonesia
## 16 16
## Iran, Islamic Rep. Iraq
## 16 16
## Ireland Israel
## 16 16
## Italy Jamaica
## 16 16
## Japan Jordan
## 16 16
## Kazakhstan Kenya
## 16 16
## Kiribati Kuwait
## 16 16
## Kyrgyz Republic Lao PDR
## 16 16
## Latvia Lebanon
## 16 16
## Lesotho Liberia
## 16 16
## Libya Lithuania
## 16 16
## Luxembourg Madagascar
## 16 16
## Malawi Malaysia
## 16 16
## Maldives Mali
## 16 16
## Malta Mauritania
## 16 16
## Mauritius Mexico
## 16 16
## Micronesia, Fed. Sts. Moldova
## 16 16
## Mongolia Montenegro
## 16 16
## Morocco Mozambique
## 16 16
## Myanmar Namibia
## 16 16
## Nepal Netherlands
## 16 16
## New Zealand Nicaragua
## 16 16
## Niger Nigeria
## 16 16
## North Macedonia Norway
## 16 16
## Oman Pakistan
## 16 16
## Panama Papua New Guinea
## 16 16
## Paraguay Peru
## 16 16
## Philippines Poland
## 16 16
## Portugal Qatar
## 16 16
## Romania Russian Federation
## 16 16
## Rwanda Samoa
## 16 16
## Sao Tome and Principe Saudi Arabia
## 16 16
## Senegal Serbia
## 16 16
## Seychelles Sierra Leone
## 16 16
## Singapore Slovak Republic
## 16 16
## Slovenia Solomon Islands
## 16 16
## Somalia South Africa
## 16 16
## Spain Sri Lanka
## 16 16
## St. Lucia St. Vincent and the Grenadines
## 16 16
## Suriname Sweden
## 16 16
## Switzerland Syrian Arab Republic
## 16 16
## Tajikistan Tanzania
## 16 16
## Thailand Timor-Leste
## 16 16
## Togo Tonga
## 16 16
## Trinidad and Tobago Tunisia
## 16 16
## Turkiye Turkmenistan
## 16 16
## Uganda Ukraine
## 16 16
## United Arab Emirates United Kingdom
## 16 16
## United States Uruguay
## 16 16
## Uzbekistan Vanuatu
## 16 16
## Venezuela, RB Vietnam
## 16 16
## Yemen, Rep. Zambia
## 16 16
## Zimbabwe
## 16
##
## $Country$proportion
## x
## Afghanistan Albania
## 0.005586592 0.005586592
## Algeria Angola
## 0.005586592 0.005586592
## Antigua and Barbuda Argentina
## 0.005586592 0.005586592
## Armenia Australia
## 0.005586592 0.005586592
## Austria Azerbaijan
## 0.005586592 0.005586592
## Bahamas, The Bahrain
## 0.005586592 0.005586592
## Bangladesh Barbados
## 0.005586592 0.005586592
## Belarus Belgium
## 0.005586592 0.005586592
## Belize Benin
## 0.005586592 0.005586592
## Bhutan Bolivia
## 0.005586592 0.005586592
## Bosnia and Herzegovina Botswana
## 0.005586592 0.005586592
## Brazil Brunei Darussalam
## 0.005586592 0.005586592
## Bulgaria Burkina Faso
## 0.005586592 0.005586592
## Burundi Cabo Verde
## 0.005586592 0.005586592
## Cambodia Cameroon
## 0.005586592 0.005586592
## Canada Central African Republic
## 0.005586592 0.005586592
## Chad Chile
## 0.005586592 0.005586592
## China Colombia
## 0.005586592 0.005586592
## Comoros Congo, Dem. Rep.
## 0.005586592 0.005586592
## Congo, Rep. Costa Rica
## 0.005586592 0.005586592
## Cote d'Ivoire Croatia
## 0.005586592 0.005586592
## Cuba Cyprus
## 0.005586592 0.005586592
## Czechia Denmark
## 0.005586592 0.005586592
## Djibouti Dominican Republic
## 0.005586592 0.005586592
## Ecuador Egypt, Arab Rep.
## 0.005586592 0.005586592
## El Salvador Equatorial Guinea
## 0.005586592 0.005586592
## Eritrea Estonia
## 0.005586592 0.005586592
## Eswatini Ethiopia
## 0.005586592 0.005586592
## Fiji Finland
## 0.005586592 0.005586592
## France Gabon
## 0.005586592 0.005586592
## Gambia, The Georgia
## 0.005586592 0.005586592
## Germany Ghana
## 0.005586592 0.005586592
## Greece Grenada
## 0.005586592 0.005586592
## Guatemala Guinea
## 0.005586592 0.005586592
## Guinea-Bissau Guyana
## 0.005586592 0.005586592
## Haiti Honduras
## 0.005586592 0.005586592
## Hungary Iceland
## 0.005586592 0.005586592
## India Indonesia
## 0.005586592 0.005586592
## Iran, Islamic Rep. Iraq
## 0.005586592 0.005586592
## Ireland Israel
## 0.005586592 0.005586592
## Italy Jamaica
## 0.005586592 0.005586592
## Japan Jordan
## 0.005586592 0.005586592
## Kazakhstan Kenya
## 0.005586592 0.005586592
## Kiribati Kuwait
## 0.005586592 0.005586592
## Kyrgyz Republic Lao PDR
## 0.005586592 0.005586592
## Latvia Lebanon
## 0.005586592 0.005586592
## Lesotho Liberia
## 0.005586592 0.005586592
## Libya Lithuania
## 0.005586592 0.005586592
## Luxembourg Madagascar
## 0.005586592 0.005586592
## Malawi Malaysia
## 0.005586592 0.005586592
## Maldives Mali
## 0.005586592 0.005586592
## Malta Mauritania
## 0.005586592 0.005586592
## Mauritius Mexico
## 0.005586592 0.005586592
## Micronesia, Fed. Sts. Moldova
## 0.005586592 0.005586592
## Mongolia Montenegro
## 0.005586592 0.005586592
## Morocco Mozambique
## 0.005586592 0.005586592
## Myanmar Namibia
## 0.005586592 0.005586592
## Nepal Netherlands
## 0.005586592 0.005586592
## New Zealand Nicaragua
## 0.005586592 0.005586592
## Niger Nigeria
## 0.005586592 0.005586592
## North Macedonia Norway
## 0.005586592 0.005586592
## Oman Pakistan
## 0.005586592 0.005586592
## Panama Papua New Guinea
## 0.005586592 0.005586592
## Paraguay Peru
## 0.005586592 0.005586592
## Philippines Poland
## 0.005586592 0.005586592
## Portugal Qatar
## 0.005586592 0.005586592
## Romania Russian Federation
## 0.005586592 0.005586592
## Rwanda Samoa
## 0.005586592 0.005586592
## Sao Tome and Principe Saudi Arabia
## 0.005586592 0.005586592
## Senegal Serbia
## 0.005586592 0.005586592
## Seychelles Sierra Leone
## 0.005586592 0.005586592
## Singapore Slovak Republic
## 0.005586592 0.005586592
## Slovenia Solomon Islands
## 0.005586592 0.005586592
## Somalia South Africa
## 0.005586592 0.005586592
## Spain Sri Lanka
## 0.005586592 0.005586592
## St. Lucia St. Vincent and the Grenadines
## 0.005586592 0.005586592
## Suriname Sweden
## 0.005586592 0.005586592
## Switzerland Syrian Arab Republic
## 0.005586592 0.005586592
## Tajikistan Tanzania
## 0.005586592 0.005586592
## Thailand Timor-Leste
## 0.005586592 0.005586592
## Togo Tonga
## 0.005586592 0.005586592
## Trinidad and Tobago Tunisia
## 0.005586592 0.005586592
## Turkiye Turkmenistan
## 0.005586592 0.005586592
## Uganda Ukraine
## 0.005586592 0.005586592
## United Arab Emirates United Kingdom
## 0.005586592 0.005586592
## United States Uruguay
## 0.005586592 0.005586592
## Uzbekistan Vanuatu
## 0.005586592 0.005586592
## Venezuela, RB Vietnam
## 0.005586592 0.005586592
## Yemen, Rep. Zambia
## 0.005586592 0.005586592
## Zimbabwe
## 0.005586592
##
##
## $Region
## $Region$table
## x
## Africa Asia
## 816 432
## Central America and Caribbean European Union
## 304 432
## Middle East North America
## 224 48
## Oceania Rest of Europe
## 176 240
## South America
## 192
##
## $Region$proportion
## x
## Africa Asia
## 0.28491620 0.15083799
## Central America and Caribbean European Union
## 0.10614525 0.15083799
## Middle East North America
## 0.07821229 0.01675978
## Oceania Rest of Europe
## 0.06145251 0.08379888
## South America
## 0.06703911
##
##
## $Economy_status_Developed
## $Economy_status_Developed$table
## x
## 0 1
## 2272 592
##
## $Economy_status_Developed$proportion
## x
## 0 1
## 0.7932961 0.2067039
##
##
## $Economy_status_Developing
## $Economy_status_Developing$table
## x
## 0 1
## 592 2272
##
## $Economy_status_Developing$proportion
## x
## 0 1
## 0.2067039 0.7932961
ggplot(df, aes(x = Country)) +
geom_bar() +
coord_flip() + # I tried to avoid label overlap, but the number of countries is so large that the plot still appears very crowded.
theme_minimal() +
labs(title = "Distribution of Countries",
x = "Country",
y = "Count")
ggplot(df, aes(x = Region)) +
geom_bar() +
coord_flip() +
theme_minimal()
ggplot(df, aes(x = factor(Economy_status_Developed))) +
geom_bar() +
theme_minimal()
ggplot(df, aes(x = factor(Economy_status_Developing))) +
geom_bar() +
theme_minimal()
Year: Year ranges from 2000 to 2015 with each year represented fairly evenly across countries, producing a uniform distribution of observations over time.
Infant_deaths: Infant_deaths ranges from 1.8 to 138.1, with a median of 19.6 and a mean of 30.4. The strong right-skew reflects that while many countries report relatively low infant mortality, a smaller group experiences very high rates.
Measles: Measles shows a bimodal and right-skewed distribution, with many observations around 60–100 but also a long tail at lower values. Measles cases vary from 10 to 99, with a median of 83 and a mean of 77.3.
BMI: BMI ranges from 19.8 to 32.1, with a median of 25.5 and a mean of 25.0. BMI is approximately symmetric and centered around 24–27, indicating that most countries have population-average BMI in the overweight range. There are few extreme values and no apparent outliers.
Incidents_HIV: Incidents_HIV ranges from 0.01 to 21.68, but has a very low median (0.15) and mean (0.89), showing a highly right-skewed pattern dominated by near-zero values. A small number of countries experience disproportionately high HIV incidence.
GDP_per_capita: GDP_per_capita spans a very wide range from 148 to 112,418, with a median of 4,217 and a mean of 11,541. The severe right skew reflects large global economic disparities, with a few high-income countries driving the mean upward.
Thinness_five_nine_years: Thinness_five_nine_years ranges from 0.1 to 28.6, with a median of 3.4 and mean of 4.9. The distribution is right-skewed, indicating that most countries have low thinness rates (falling between 0 and 10%) while a smaller subset shows much higher prevalence.
Schooling: Schooling ranges from 1.1 to 14.1 years (slightly right-skewed), with a median of 7.8 and mean of 7.63. The values are broadly distributed, reflecting substantial variation in educational attainment across countries.
Life_expectancy: Life_expectancy ranges from 39.4 to 83.8 years, with a median of 71.4 and a mean of 68.9. The distribution is slightly left-skewed, with many countries concentrated in the 65–75 year range.
Country: The Country variable contains 179 distinct countries, and each country contributes repeated observations over time. For example, many countries appear 16 times in the dataset, leading to a wide spread of frequencies across levels.
Region: The Region variable categorizes countries into nine global areas. Africa accounts for the largest share with 816 records, followed by Asia and the European Union with 432 each, and Central America/Caribbean with 304 entries. These counts show that regional representation is uneven across the dataset.
Economy_status_Developed: This binary variable identifies whether an observation is from a developed economy. A total of 592 observations are marked as developed, while the remaining 2,272 are coded as non-developed, indicating that developed economies make up only a small portion of the data.
Economy_status_Developing: This indicator variable designates whether an observation belongs to a developing economy. Because it is defined as the complement of the “Developed” variable, the counts invert accordingly: 2,272 observations are coded as developing and 592 as non-developing. This aligns with the overall structure of the dataset, in which developing countries are more prevalent.
# Create new variable indicating if Infant Deaths >= 30
df <- df %>%
mutate(InfantDeaths_over30 = ifelse(Infant_deaths >= 30, 1, 0))
# Check result
table(df$InfantDeaths_over30)
##
## 0 1
## 1796 1068
# Restrict to only year 2010
df_2010 <- df %>%
filter(Year == 2010)
# Check result
summary(df_2010$Year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2010 2010 2010 2010 2010 2010
# Create a subset excluding the specified variables
df_problem1 <- df_2010 %>%
select(-Country, -Year, -Life_expectancy, -Infant_deaths)
# Check structure
str(df_problem1)
## 'data.frame': 179 obs. of 10 variables:
## $ Region : Factor w/ 9 levels "Africa","Asia",..: 8 4 3 4 2 8 1 1 4 2 ...
## $ Measles : int 83 93 91 94 93 97 64 64 96 85 ...
## $ BMI : num 25.1 26.3 28.4 25.8 26.4 26.4 23 23.8 26.1 23.3 ...
## $ Incidents_HIV : num 0.06 0.03 0.37 0.08 0.1 0.04 2.41 5.85 0.02 0.16 ...
## $ GDP_per_capita : int 82027 37761 4768 48370 33437 72427 586 4147 20613 2235 ...
## $ Thinness_five_nine_years : num 0.4 1.1 3.4 1.3 5.4 0.7 7.9 1.4 1.6 17.3 ...
## $ Schooling : num 13.3 13.8 10.5 12.3 8.8 12.7 2.6 6.2 12.1 2.3 ...
## $ Economy_status_Developed : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 1 2 1 ...
## $ Economy_status_Developing: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 2 2 1 2 ...
## $ InfantDeaths_over30 : num 0 0 0 0 0 0 1 1 0 1 ...
# training+validation = 75% vs. testing = 25%
set.seed(123)
trainval <- createDataPartition(y=df_problem1$InfantDeaths_over30, p=0.75, list=FALSE)
dftrval <- df_problem1[trainval,]
dftest <- df_problem1[-trainval,]
# training = 50% vs. Validation = 25%
set.seed(123)
train <- createDataPartition(y=dftrval$InfantDeaths_over30, p=2/3, list=FALSE)
dftr <- dftrval[train,]
dfval <- dftrval[-train,]
# Verify the data
dim(df); dim(dftrval); dim(dftest); dim(dftr); dim(dfval)
## [1] 2864 14
## [1] 135 10
## [1] 44 10
## [1] 90 10
## [1] 45 10
# Check outcome variable
table(df_problem1$InfantDeaths_over30)
##
## 0 1
## 117 62
prop.table(table(df_problem1$InfantDeaths_over30))
##
## 0 1
## 0.6536313 0.3463687
# Identify numeric variables
num_vars <- df_problem1 %>% select_if(is.numeric) %>% select(-InfantDeaths_over30)
# Numeric summary by group
df_problem1 %>%
group_by(InfantDeaths_over30) %>%
summarise(across(where(is.numeric),
list(mean = mean, sd = sd, median = median),
.names = "{.col}_{.fn}"))
## # A tibble: 2 × 19
## InfantDeaths_over30 Measles_mean Measles_sd Measles_median BMI_mean BMI_sd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 84.5 16.4 91 26.3 1.53
## 2 1 65.0 17.2 64 23.2 1.71
## # ℹ 13 more variables: BMI_median <dbl>, Incidents_HIV_mean <dbl>,
## # Incidents_HIV_sd <dbl>, Incidents_HIV_median <dbl>,
## # GDP_per_capita_mean <dbl>, GDP_per_capita_sd <dbl>,
## # GDP_per_capita_median <dbl>, Thinness_five_nine_years_mean <dbl>,
## # Thinness_five_nine_years_sd <dbl>, Thinness_five_nine_years_median <dbl>,
## # Schooling_mean <dbl>, Schooling_sd <dbl>, Schooling_median <dbl>
# Boxplot
num_var_names <- names(num_vars)
for (v in num_var_names) {
print(
ggplot(df_problem1, aes(x = as.factor(InfantDeaths_over30), y = .data[[v]], fill = as.factor(InfantDeaths_over30))) +
geom_boxplot() +
labs(x = "Infant Deaths ≥ 30", y = v, title = paste("Boxplot of", v, "by Outcome")) +
theme_minimal() +
scale_fill_manual(values = c("#56B4E9", "#E69F00"))
)
}
Measles: Countries with infant deaths >=30 tend to have lower measles values (median 64 vs. 91), and the boxplot shows a downward shift in the distribution. This suggests that higher infant mortality is associated with lower measles reporting rates or differing surveillance patterns.
BMI: BMI levels are lower in the >=30 infant death group (median 23.2 vs. 26.1), and the two boxplots barely overlap. This indicates that countries with higher infant mortality generally have lower population BMI.
Incidents_HIV: HIV incidence is markedly higher in countries with infant deaths >=30 (median ~1.1 vs. ~0.02), and the distribution is heavily right-skewed in this group. This pattern suggests that HIV burden is greater in settings where infant mortality is high.
GDP_per_capita: GDP per capita is dramatically lower in the high-infant-death group (median around 1,300 vs. 8,200), and the boxplot shows little overlap between groups. This strong contrast suggests that poorer countries experience far higher infant mortality.
Thinness_five_nine_years: Child thinness is substantially higher in the >=30 group (median ~8.1 vs. ~2.6), with the boxplot showing a clear upward shift. This indicates that undernutrition among children is strongly associated with higher infant deaths.
Schooling: Average years of schooling are much lower among countries with >=30 infant deaths (median ~4.2 vs. ~9.7), and the distributions are distinctly separated. This implies that lower educational attainment is strongly correlated with higher infant mortality.
# Identify categorical variables
cat_vars <- df_problem1 %>% select(where(is.factor))
# Frequency tables by outcome
for (v in names(cat_vars)) {
print(v)
print(table(df_problem1[[v]], df_problem1$InfantDeaths_over30))
print(prop.table(table(df_problem1[[v]], df_problem1$InfantDeaths_over30), margin = 2))
}
## [1] "Region"
##
## 0 1
## Africa 8 43
## Asia 15 12
## Central America and Caribbean 18 1
## European Union 27 0
## Middle East 13 1
## North America 3 0
## Oceania 8 3
## Rest of Europe 15 0
## South America 10 2
##
## 0 1
## Africa 0.06837607 0.69354839
## Asia 0.12820513 0.19354839
## Central America and Caribbean 0.15384615 0.01612903
## European Union 0.23076923 0.00000000
## Middle East 0.11111111 0.01612903
## North America 0.02564103 0.00000000
## Oceania 0.06837607 0.04838710
## Rest of Europe 0.12820513 0.00000000
## South America 0.08547009 0.03225806
## [1] "Economy_status_Developed"
##
## 0 1
## 0 80 62
## 1 37 0
##
## 0 1
## 0 0.6837607 1.0000000
## 1 0.3162393 0.0000000
## [1] "Economy_status_Developing"
##
## 0 1
## 0 37 0
## 1 80 62
##
## 0 1
## 0 0.3162393 0.0000000
## 1 0.6837607 1.0000000
Region: Africa accounts for most observations with infant deaths >=30 (43 out of 62), while regions such as Europe, North America, and Rest of Europe contribute none. This indicates that high infant mortality is geographically concentrated, especially in Africa and parts of Asia.
Economy_status_Developed: All observations classified as developed economies fall in the <30 infant deaths group (37 vs. 0), showing that no developed country in the dataset experienced infant deaths >=30. This highlights a strong separation between developed status and high infant mortality.
Economy_status_Developing: The pattern reverses for developing economies: 62 of the 62 high–infant-death observations come from developing countries. This reinforces the relationship between lower economic development and elevated infant mortality.
# Train+validation -> x, y
x_trval <- model.matrix(InfantDeaths_over30 ~ ., data = dftrval)[, -1]
y_trval <- dftrval$InfantDeaths_over30
set.seed(123)
# Lambda grid
grid <- c(10^seq(4, -2, length = 99), 0)
# Print
grid
## [1] 1.000000e+04 8.685114e+03 7.543120e+03 6.551286e+03 5.689866e+03
## [6] 4.941713e+03 4.291934e+03 3.727594e+03 3.237458e+03 2.811769e+03
## [11] 2.442053e+03 2.120951e+03 1.842070e+03 1.599859e+03 1.389495e+03
## [16] 1.206793e+03 1.048113e+03 9.102982e+02 7.906043e+02 6.866488e+02
## [21] 5.963623e+02 5.179475e+02 4.498433e+02 3.906940e+02 3.393222e+02
## [26] 2.947052e+02 2.559548e+02 2.222996e+02 1.930698e+02 1.676833e+02
## [31] 1.456348e+02 1.264855e+02 1.098541e+02 9.540955e+01 8.286428e+01
## [36] 7.196857e+01 6.250552e+01 5.428675e+01 4.714866e+01 4.094915e+01
## [41] 3.556480e+01 3.088844e+01 2.682696e+01 2.329952e+01 2.023590e+01
## [46] 1.757511e+01 1.526418e+01 1.325711e+01 1.151395e+01 1.000000e+01
## [51] 8.685114e+00 7.543120e+00 6.551286e+00 5.689866e+00 4.941713e+00
## [56] 4.291934e+00 3.727594e+00 3.237458e+00 2.811769e+00 2.442053e+00
## [61] 2.120951e+00 1.842070e+00 1.599859e+00 1.389495e+00 1.206793e+00
## [66] 1.048113e+00 9.102982e-01 7.906043e-01 6.866488e-01 5.963623e-01
## [71] 5.179475e-01 4.498433e-01 3.906940e-01 3.393222e-01 2.947052e-01
## [76] 2.559548e-01 2.222996e-01 1.930698e-01 1.676833e-01 1.456348e-01
## [81] 1.264855e-01 1.098541e-01 9.540955e-02 8.286428e-02 7.196857e-02
## [86] 6.250552e-02 5.428675e-02 4.714866e-02 4.094915e-02 3.556480e-02
## [91] 3.088844e-02 2.682696e-02 2.329952e-02 2.023590e-02 1.757511e-02
## [96] 1.526418e-02 1.325711e-02 1.151395e-02 1.000000e-02 0.000000e+00
# Cross-validation to find optimal lambda
ridge_cv <- cv.glmnet(x = x_trval, y = y_trval, family = "binomial", lambda = grid, alpha = 0)
names(ridge_cv)
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "call" "name" "glmnet.fit" "lambda.min"
## [11] "lambda.1se" "index"
# Plots
plot(ridge_cv)
title("Ridge CV - binomial", line = 2.5)
# Two optimal lambdas
ridge_lambda_min <- ridge_cv$lambda.min
ridge_lambda_min
## [1] 0.01526418
-log(ridge_lambda_min)
## [1] 4.182246
ridge_lambda_1se <- ridge_cv$lambda.1se
ridge_lambda_1se
## [1] 0.2559548
-log(ridge_lambda_1se)
## [1] 1.362754
ridge_cv$cvm
## [1] 1.3337705 1.3337528 1.3337325 1.3337091 1.3336821 1.3336511 1.3336154
## [8] 1.3335742 1.3335269 1.3334723 1.3334096 1.3333373 1.3332541 1.3331583
## [15] 1.3330481 1.3329212 1.3327752 1.3326071 1.3324136 1.3321910 1.3319348
## [22] 1.3316401 1.3313009 1.3309108 1.3304621 1.3299461 1.3293528 1.3286707
## [29] 1.3278868 1.3269861 1.3259516 1.3247637 1.3234003 1.3218363 1.3200493
## [36] 1.3179987 1.3156493 1.3129619 1.3098910 1.3063857 1.3023897 1.2978407
## [43] 1.2926709 1.2868065 1.2801682 1.2726719 1.2642296 1.2547508 1.2441444
## [50] 1.2323216 1.2191981 1.2046989 1.1887614 1.1713407 1.1524135 1.1319825
## [57] 1.1100799 1.0867701 1.0621507 1.0363516 1.0095333 0.9818822 0.9536062
## [64] 0.9249274 0.8960759 0.8672824 0.8387743 0.8107708 0.7834646 0.7570228
## [71] 0.7316029 0.7073369 0.6843284 0.6626562 0.6423745 0.6235148 0.6060878
## [78] 0.5900852 0.5754822 0.5622400 0.5503081 0.5396260 0.5301266 0.5217374
## [85] 0.5143838 0.5079898 0.5024860 0.4977902 0.4938350 0.4905799 0.4879432
## [92] 0.4858592 0.4842958 0.4831613 0.4825218 0.4822892 0.4824458 0.4829522
## [99] 0.4837880 0.9905219
ridge_cv$index
## Lambda
## min 96
## 1se 76
ridge_cv$cvm[ridge_cv$index]
## [1] 0.4822892 0.6235148
coef(ridge_cv, c(ridge_lambda_min, ridge_lambda_1se))
## 17 x 2 sparse Matrix of class "dgCMatrix"
## s=0.01526418 s=0.25595479
## (Intercept) 1.355210e+01 5.911226e+00
## RegionAsia 2.878540e-01 8.542672e-02
## RegionCentral America and Caribbean -2.423031e+00 -7.544565e-01
## RegionEuropean Union -5.856813e-01 -3.168081e-01
## RegionMiddle East -1.588611e+00 -5.932371e-01
## RegionNorth America -1.426218e-01 -1.770785e-01
## RegionOceania -7.510754e-01 -2.489412e-01
## RegionRest of Europe -1.527645e+00 -5.132385e-01
## RegionSouth America -1.338828e-01 -2.134472e-01
## Measles -1.621042e-02 -1.342859e-02
## BMI -4.489189e-01 -1.951587e-01
## Incidents_HIV 4.158642e-01 1.375514e-01
## GDP_per_capita -4.072031e-05 -1.074347e-05
## Thinness_five_nine_years -1.682990e-02 4.135954e-02
## Schooling -2.676139e-01 -1.158447e-01
## Economy_status_Developed1 -6.274822e-01 -3.210898e-01
## Economy_status_Developing1 6.269228e-01 3.210882e-01
ridge_fit <- glmnet(x_trval, y_trval,
family = "binomial",
alpha = 0,
lambda = grid)
plot(ridge_fit, xvar = "lambda")
title("Ridge coefficient paths", line = 2.5)
Comment: Using cross-validated Ridge regression on the combined training/validation set, the model identified λ_min = 0.0153 as the tuning parameter that minimizes the binomial deviance, with a corresponding deviance of approximately 0.48. The more conservative λ_1se = 0.2559 produced a slightly higher deviance (~0.62) but yields a simpler and more regularized model.
The cross-validation plot showed a clear decrease in deviance as penalization weakens (moving toward smaller λ values), after which the curve flattens, indicating diminishing returns in model improvement.The coefficient-path plot demonstrated the characteristic Ridge pattern where all coefficients smoothly shrunk toward zero as λ increases, without any coefficients dropping exactly to zero.
Overall, the Ridge model suggested that variables such as GDP per capita, schooling, regional indicators, and HIV incidence contributed meaningfully to predicting whether infant deaths exceed 30, though effect sizes remained heavily regularized due to the penalty.
# Create a logistic baseline model
logit_std <- glm(InfantDeaths_over30 ~ ., data = dftrval, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
logit_sum <- summary(logit_std)$coefficients
# Extract coefficients at optimal lambda
ridge_coef <- coef(ridge_cv, s = "lambda.min")
ridge_coef
## 17 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 1.355210e+01
## RegionAsia 2.878540e-01
## RegionCentral America and Caribbean -2.423031e+00
## RegionEuropean Union -5.856813e-01
## RegionMiddle East -1.588611e+00
## RegionNorth America -1.426218e-01
## RegionOceania -7.510754e-01
## RegionRest of Europe -1.527645e+00
## RegionSouth America -1.338828e-01
## Measles -1.621042e-02
## BMI -4.489189e-01
## Incidents_HIV 4.158642e-01
## GDP_per_capita -4.072031e-05
## Thinness_five_nine_years -1.682990e-02
## Schooling -2.676139e-01
## Economy_status_Developed1 -6.274822e-01
## Economy_status_Developing1 6.269228e-01
# Convert sparse matrix to vector
ridge_coef_vec <- as.numeric(ridge_coef)
names(ridge_coef_vec) <- rownames(ridge_coef)
# Combine both into a single data frame
coef_table_ridge <- data.frame(
term = rownames(logit_sum),
Estimate_logit = logit_sum[, "Estimate"],
StdError_logit = logit_sum[, "Std. Error"],
zvalue_logit = logit_sum[, "z value"],
pvalue_logit = logit_sum[, "Pr(>|z|)"],
Ridge_coef = ridge_coef_vec[rownames(logit_sum)]
)
coef_table_ridge
## term
## (Intercept) (Intercept)
## RegionAsia RegionAsia
## RegionCentral America and Caribbean RegionCentral America and Caribbean
## RegionEuropean Union RegionEuropean Union
## RegionMiddle East RegionMiddle East
## RegionNorth America RegionNorth America
## RegionOceania RegionOceania
## RegionRest of Europe RegionRest of Europe
## RegionSouth America RegionSouth America
## Measles Measles
## BMI BMI
## Incidents_HIV Incidents_HIV
## GDP_per_capita GDP_per_capita
## Thinness_five_nine_years Thinness_five_nine_years
## Schooling Schooling
## Economy_status_Developed1 Economy_status_Developed1
## Estimate_logit StdError_logit zvalue_logit
## (Intercept) 2.882520750 8.627514e+00 0.3341079200
## RegionAsia 1.739465362 1.703934e+00 1.0208526882
## RegionCentral America and Caribbean -20.736333530 6.104900e+03 -0.0033966706
## RegionEuropean Union -29.650613002 9.062427e+03 -0.0032718182
## RegionMiddle East -15.680368591 5.421936e+03 -0.0028920237
## RegionNorth America 11.284343170 1.986258e+04 0.0005681207
## RegionOceania -0.667779777 1.354853e+00 -0.4928798237
## RegionRest of Europe -18.314231089 7.532661e+03 -0.0024313097
## RegionSouth America 2.828459661 1.842900e+00 1.5347877436
## Measles 0.011526007 3.211069e-02 0.3589460730
## BMI 0.026456018 3.555306e-01 0.0744127722
## Incidents_HIV 3.917870992 1.578883e+00 2.4814201324
## GDP_per_capita -0.001254931 5.428571e-04 -2.3117157239
## Thinness_five_nine_years 0.001535216 1.135536e-01 0.0135197447
## Schooling -0.390462028 3.288820e-01 -1.1872403508
## Economy_status_Developed1 19.726862727 8.485774e+03 0.0023246981
## pvalue_logit Ridge_coef
## (Intercept) 0.73829812 1.355210e+01
## RegionAsia 0.30732424 2.878540e-01
## RegionCentral America and Caribbean 0.99728985 -2.423031e+00
## RegionEuropean Union 0.99738947 -5.856813e-01
## RegionMiddle East 0.99769250 -1.588611e+00
## RegionNorth America 0.99954671 -1.426218e-01
## RegionOceania 0.62209750 -7.510754e-01
## RegionRest of Europe 0.99806010 -1.527645e+00
## RegionSouth America 0.12483597 -1.338828e-01
## Measles 0.71963543 -1.621042e-02
## BMI 0.94068195 -4.489189e-01
## Incidents_HIV 0.01308600 4.158642e-01
## GDP_per_capita 0.02079335 -4.072031e-05
## Thinness_five_nine_years 0.98921313 -1.682990e-02
## Schooling 0.23513283 -2.676139e-01
## Economy_status_Developed1 0.99814516 -6.274822e-01
coef_table_ridge %>%
filter(term != "(Intercept)") %>%
arrange(desc(abs(Ridge_coef)))
## term
## RegionCentral America and Caribbean RegionCentral America and Caribbean
## RegionMiddle East RegionMiddle East
## RegionRest of Europe RegionRest of Europe
## RegionOceania RegionOceania
## Economy_status_Developed1 Economy_status_Developed1
## RegionEuropean Union RegionEuropean Union
## BMI BMI
## Incidents_HIV Incidents_HIV
## RegionAsia RegionAsia
## Schooling Schooling
## RegionNorth America RegionNorth America
## RegionSouth America RegionSouth America
## Thinness_five_nine_years Thinness_five_nine_years
## Measles Measles
## GDP_per_capita GDP_per_capita
## Estimate_logit StdError_logit zvalue_logit
## RegionCentral America and Caribbean -20.736333530 6.104900e+03 -0.0033966706
## RegionMiddle East -15.680368591 5.421936e+03 -0.0028920237
## RegionRest of Europe -18.314231089 7.532661e+03 -0.0024313097
## RegionOceania -0.667779777 1.354853e+00 -0.4928798237
## Economy_status_Developed1 19.726862727 8.485774e+03 0.0023246981
## RegionEuropean Union -29.650613002 9.062427e+03 -0.0032718182
## BMI 0.026456018 3.555306e-01 0.0744127722
## Incidents_HIV 3.917870992 1.578883e+00 2.4814201324
## RegionAsia 1.739465362 1.703934e+00 1.0208526882
## Schooling -0.390462028 3.288820e-01 -1.1872403508
## RegionNorth America 11.284343170 1.986258e+04 0.0005681207
## RegionSouth America 2.828459661 1.842900e+00 1.5347877436
## Thinness_five_nine_years 0.001535216 1.135536e-01 0.0135197447
## Measles 0.011526007 3.211069e-02 0.3589460730
## GDP_per_capita -0.001254931 5.428571e-04 -2.3117157239
## pvalue_logit Ridge_coef
## RegionCentral America and Caribbean 0.99728985 -2.423031e+00
## RegionMiddle East 0.99769250 -1.588611e+00
## RegionRest of Europe 0.99806010 -1.527645e+00
## RegionOceania 0.62209750 -7.510754e-01
## Economy_status_Developed1 0.99814516 -6.274822e-01
## RegionEuropean Union 0.99738947 -5.856813e-01
## BMI 0.94068195 -4.489189e-01
## Incidents_HIV 0.01308600 4.158642e-01
## RegionAsia 0.30732424 2.878540e-01
## Schooling 0.23513283 -2.676139e-01
## RegionNorth America 0.99954671 -1.426218e-01
## RegionSouth America 0.12483597 -1.338828e-01
## Thinness_five_nine_years 0.98921313 -1.682990e-02
## Measles 0.71963543 -1.621042e-02
## GDP_per_capita 0.02079335 -4.072031e-05
Comment: Using the logistic regression coefficients as a reference, I printed a coefficient table that also includes the Ridge regression coefficients evaluated at the optimal λ (lambda.min = 0.0153).
When ranking predictors by the absolute value of the Ridge coefficients, the largest coefficients were observed for regional indicators, particularly Central America and Caribbean, Middle East, Rest of Europe, and Oceania. This could mean that geographic factor plays a major role in distinguishing countries with >=30 infant deaths. Economic development status (Developed vs. Developing), BMI, and HIV incidence also showed relatively large Ridge coefficients. However, GDP per capita, Measles, and Thinness (5-9 years) had coefficients very close to 0 and appeared less important after regularization.
set.seed(123)
# Lambda grid
grid <- c(10^seq(4, -2, length = 99), 0)
# Print
grid
## [1] 1.000000e+04 8.685114e+03 7.543120e+03 6.551286e+03 5.689866e+03
## [6] 4.941713e+03 4.291934e+03 3.727594e+03 3.237458e+03 2.811769e+03
## [11] 2.442053e+03 2.120951e+03 1.842070e+03 1.599859e+03 1.389495e+03
## [16] 1.206793e+03 1.048113e+03 9.102982e+02 7.906043e+02 6.866488e+02
## [21] 5.963623e+02 5.179475e+02 4.498433e+02 3.906940e+02 3.393222e+02
## [26] 2.947052e+02 2.559548e+02 2.222996e+02 1.930698e+02 1.676833e+02
## [31] 1.456348e+02 1.264855e+02 1.098541e+02 9.540955e+01 8.286428e+01
## [36] 7.196857e+01 6.250552e+01 5.428675e+01 4.714866e+01 4.094915e+01
## [41] 3.556480e+01 3.088844e+01 2.682696e+01 2.329952e+01 2.023590e+01
## [46] 1.757511e+01 1.526418e+01 1.325711e+01 1.151395e+01 1.000000e+01
## [51] 8.685114e+00 7.543120e+00 6.551286e+00 5.689866e+00 4.941713e+00
## [56] 4.291934e+00 3.727594e+00 3.237458e+00 2.811769e+00 2.442053e+00
## [61] 2.120951e+00 1.842070e+00 1.599859e+00 1.389495e+00 1.206793e+00
## [66] 1.048113e+00 9.102982e-01 7.906043e-01 6.866488e-01 5.963623e-01
## [71] 5.179475e-01 4.498433e-01 3.906940e-01 3.393222e-01 2.947052e-01
## [76] 2.559548e-01 2.222996e-01 1.930698e-01 1.676833e-01 1.456348e-01
## [81] 1.264855e-01 1.098541e-01 9.540955e-02 8.286428e-02 7.196857e-02
## [86] 6.250552e-02 5.428675e-02 4.714866e-02 4.094915e-02 3.556480e-02
## [91] 3.088844e-02 2.682696e-02 2.329952e-02 2.023590e-02 1.757511e-02
## [96] 1.526418e-02 1.325711e-02 1.151395e-02 1.000000e-02 0.000000e+00
# Cross-validation to find optimal lambda
lasso_cv <- cv.glmnet(x = x_trval, y = y_trval, family = "binomial", lambda = grid, alpha = 1)
names(lasso_cv)
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "call" "name" "glmnet.fit" "lambda.min"
## [11] "lambda.1se" "index"
dim(coef(lasso_cv))
## [1] 17 1
# Plots
plot(lasso_cv)
title("LASSO CV - binomial", line = 2.5)
# Two optimal lambdas
lasso_lambda_min <- lasso_cv$lambda.min
lasso_lambda_min
## [1] 0.01
-log(lasso_lambda_min)
## [1] 4.60517
lasso_lambda_1se <- lasso_cv$lambda.1se
lasso_lambda_1se
## [1] 0.09540955
-log(lasso_lambda_1se)
## [1] 2.349577
lasso_cv$cvm
## [1] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [8] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [15] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [22] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [29] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [36] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [43] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [50] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [57] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [64] 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872 1.3338872
## [71] 1.3338872 1.3338872 1.3338872 1.3070293 1.1754549 1.0597293 0.9682990
## [78] 0.8948038 0.8350894 0.7866688 0.7407394 0.6946588 0.6579043 0.6285050
## [85] 0.6050606 0.5845062 0.5666029 0.5518545 0.5391986 0.5282819 0.5194396
## [92] 0.5129176 0.5078816 0.5042810 0.5011912 0.4986654 0.4968462 0.4946516
## [99] 0.4935439 0.9606149
lasso_cv$index
## Lambda
## min 99
## 1se 83
lasso_cv$cvm[lasso_cv$index]
## [1] 0.4935439 0.6579043
coef(lasso_cv, c(lasso_lambda_min, lasso_lambda_1se))
## 17 x 2 sparse Matrix of class "dgCMatrix"
## s=0.01000000 s=0.09540955
## (Intercept) 1.503035e+01 10.98572793
## RegionAsia 2.260266e-01 .
## RegionCentral America and Caribbean -2.323942e+00 .
## RegionEuropean Union . .
## RegionMiddle East -4.813233e-01 .
## RegionNorth America . .
## RegionOceania -1.710656e-01 .
## RegionRest of Europe -3.805341e-01 .
## RegionSouth America 1.646567e-01 .
## Measles -7.084773e-03 .
## BMI -5.074938e-01 -0.40636130
## Incidents_HIV 5.280790e-01 0.08492836
## GDP_per_capita -7.165106e-05 .
## Thinness_five_nine_years . .
## Schooling -3.139034e-01 -0.20142368
## Economy_status_Developed1 . .
## Economy_status_Developing1 . .
lasso_fit <- glmnet(x_trval, y_trval,
family = "binomial",
alpha = 1,
lambda = grid)
plot(lasso_fit, xvar = "lambda")
title("Lasso coefficient paths", line = 2.5)
Comment: Using LASSO with cross validation, the optimal tuning parameter was lambda_min = 0.01, with a more parsimonious choice of lambda_1se = 0.0954. At lambda_min, several predictors have nonzero coefficients, while at lambda_1se only BMI, HIV incidence, and years of schooling remain in the model, indicating that these variables are the most important predictors of having 30 or more infant deaths. The minimum cross validated binomial deviance was about 0.49, and the 1SE had a slightly higher deviance of about 0.66.
# logistic baseline model summary
logit_sum <- summary(logit_std)$coefficients
# LASSO coefficients at optimal lambda (lambda.min)
lasso_coef <- coef(lasso_cv, s = lasso_lambda_min)
lasso_coef_vec <- as.numeric(lasso_coef)
names(lasso_coef_vec) <- rownames(lasso_coef)
# logistic + LASSO coefficient table
coef_table_lasso <- data.frame(
term = rownames(logit_sum),
Estimate_logit = logit_sum[, "Estimate"],
StdError_logit = logit_sum[, "Std. Error"],
zvalue_logit = logit_sum[, "z value"],
pvalue_logit = logit_sum[, "Pr(>|z|)"],
LASSO_coef = lasso_coef_vec[rownames(logit_sum)]
)
coef_table_lasso
## term
## (Intercept) (Intercept)
## RegionAsia RegionAsia
## RegionCentral America and Caribbean RegionCentral America and Caribbean
## RegionEuropean Union RegionEuropean Union
## RegionMiddle East RegionMiddle East
## RegionNorth America RegionNorth America
## RegionOceania RegionOceania
## RegionRest of Europe RegionRest of Europe
## RegionSouth America RegionSouth America
## Measles Measles
## BMI BMI
## Incidents_HIV Incidents_HIV
## GDP_per_capita GDP_per_capita
## Thinness_five_nine_years Thinness_five_nine_years
## Schooling Schooling
## Economy_status_Developed1 Economy_status_Developed1
## Estimate_logit StdError_logit zvalue_logit
## (Intercept) 2.882520750 8.627514e+00 0.3341079200
## RegionAsia 1.739465362 1.703934e+00 1.0208526882
## RegionCentral America and Caribbean -20.736333530 6.104900e+03 -0.0033966706
## RegionEuropean Union -29.650613002 9.062427e+03 -0.0032718182
## RegionMiddle East -15.680368591 5.421936e+03 -0.0028920237
## RegionNorth America 11.284343170 1.986258e+04 0.0005681207
## RegionOceania -0.667779777 1.354853e+00 -0.4928798237
## RegionRest of Europe -18.314231089 7.532661e+03 -0.0024313097
## RegionSouth America 2.828459661 1.842900e+00 1.5347877436
## Measles 0.011526007 3.211069e-02 0.3589460730
## BMI 0.026456018 3.555306e-01 0.0744127722
## Incidents_HIV 3.917870992 1.578883e+00 2.4814201324
## GDP_per_capita -0.001254931 5.428571e-04 -2.3117157239
## Thinness_five_nine_years 0.001535216 1.135536e-01 0.0135197447
## Schooling -0.390462028 3.288820e-01 -1.1872403508
## Economy_status_Developed1 19.726862727 8.485774e+03 0.0023246981
## pvalue_logit LASSO_coef
## (Intercept) 0.73829812 1.503035e+01
## RegionAsia 0.30732424 2.260266e-01
## RegionCentral America and Caribbean 0.99728985 -2.323942e+00
## RegionEuropean Union 0.99738947 0.000000e+00
## RegionMiddle East 0.99769250 -4.813233e-01
## RegionNorth America 0.99954671 0.000000e+00
## RegionOceania 0.62209750 -1.710656e-01
## RegionRest of Europe 0.99806010 -3.805341e-01
## RegionSouth America 0.12483597 1.646567e-01
## Measles 0.71963543 -7.084773e-03
## BMI 0.94068195 -5.074938e-01
## Incidents_HIV 0.01308600 5.280790e-01
## GDP_per_capita 0.02079335 -7.165106e-05
## Thinness_five_nine_years 0.98921313 0.000000e+00
## Schooling 0.23513283 -3.139034e-01
## Economy_status_Developed1 0.99814516 0.000000e+00
# Important coefficients
coef_table_lasso %>%
dplyr::filter(term != "(Intercept)", LASSO_coef != 0) %>%
dplyr::arrange(dplyr::desc(abs(LASSO_coef)))
## term
## RegionCentral America and Caribbean RegionCentral America and Caribbean
## Incidents_HIV Incidents_HIV
## BMI BMI
## RegionMiddle East RegionMiddle East
## RegionRest of Europe RegionRest of Europe
## Schooling Schooling
## RegionAsia RegionAsia
## RegionOceania RegionOceania
## RegionSouth America RegionSouth America
## Measles Measles
## GDP_per_capita GDP_per_capita
## Estimate_logit StdError_logit zvalue_logit
## RegionCentral America and Caribbean -20.736333530 6.104900e+03 -0.003396671
## Incidents_HIV 3.917870992 1.578883e+00 2.481420132
## BMI 0.026456018 3.555306e-01 0.074412772
## RegionMiddle East -15.680368591 5.421936e+03 -0.002892024
## RegionRest of Europe -18.314231089 7.532661e+03 -0.002431310
## Schooling -0.390462028 3.288820e-01 -1.187240351
## RegionAsia 1.739465362 1.703934e+00 1.020852688
## RegionOceania -0.667779777 1.354853e+00 -0.492879824
## RegionSouth America 2.828459661 1.842900e+00 1.534787744
## Measles 0.011526007 3.211069e-02 0.358946073
## GDP_per_capita -0.001254931 5.428571e-04 -2.311715724
## pvalue_logit LASSO_coef
## RegionCentral America and Caribbean 0.99728985 -2.323942e+00
## Incidents_HIV 0.01308600 5.280790e-01
## BMI 0.94068195 -5.074938e-01
## RegionMiddle East 0.99769250 -4.813233e-01
## RegionRest of Europe 0.99806010 -3.805341e-01
## Schooling 0.23513283 -3.139034e-01
## RegionAsia 0.30732424 2.260266e-01
## RegionOceania 0.62209750 -1.710656e-01
## RegionSouth America 0.12483597 1.646567e-01
## Measles 0.71963543 -7.084773e-03
## GDP_per_capita 0.02079335 -7.165106e-05
Comment: Using the optimal lambda (λmin = 0.01), the most important variables, based on the magnitude of the LASSO coefficients, were Region (Central America and Caribbean, Middle East, Rest of Europe, Asia, Oceania, South America), Incidents_HIV, BMI, and Schooling. All other predictors were shrunk to zero, indicating limited additional predictive value for infant deaths >=30.
### Final model: LASSO with lambda.1se
# Final lambda selection (lasso, lambda.1se)
lasso_lambda_1se <- lasso_cv$lambda.1se
# test set model matrix and outcome
x_test <- model.matrix(InfantDeaths_over30 ~ ., data = dftest)[, -1]
y_test <- dftest$InfantDeaths_over30
# testset probability
test_prob <- as.numeric(
predict(lasso_cv,
s = lasso_lambda_1se,
newx = x_test,
type = "response")
)
# test MSE
test_mse <- mean( (test_prob - y_test)^2 )
test_mse
## [1] 0.09311305
# 0.5 cut-off
test_pred_class <- ifelse(test_prob > 0.5, 1, 0)
# test error
test_error <- mean(test_pred_class != y_test)
test_error
## [1] 0.06818182
plot_df <- data.frame(
InfantDeaths_over30 = factor(y_test, levels = c(0, 1)),
prob = test_prob
)
ggplot(plot_df, aes(x = InfantDeaths_over30, y = prob)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 0.4) +
labs(
x = "Observed InfantDeaths_over30",
y = "Predicted probability",
title = "LASSO (lambda.1se) predicted probabilities on the test set"
) +
theme_minimal()
Comment: I selected the LASSO model with λ.1se as the final model because it provided a more parsimonious model while maintaining similar predictive performance to λ.min. The λ.1se model also reduced the risk of overfitting, which is important given the relatively small sample size in the training and validation sets. The final LASSO model achieved a test MSE of 0.093 and a misclassification error rate of 6.8%, suggesting good predictive accuracy. The predicted probabilities showed clear separation between the two outcome groups. Countries with fewer than 30 infant deaths predicted probabilities mostly below 0.3, whereas those with 30 or more deaths clustered around 0.6–0.8. This indicates that the model captures meaningful differences between groups and fits the data well.
Based on the LASSO model using the λ.1se penalty, several variables emerge as meaningful predictors of whether a country experiences 30 or more infant deaths. Because LASSO shrinks unimportant predictors exactly to zero, the remaining non-zero coefficients represent variables that contribute most strongly to distinguishing high-mortality countries from low-mortality countries.
Countries in regions such as Central America and the Caribbean, the Middle East, and Oceania show substantially different infant-mortality risks compared with the reference region, even after accounting for other variables. These region effects likely reflect broader structural factors such as health-system capacity, access to prenatal care, or public-health infrastructure.
Among continuous predictors, higher HIV incidence and lower BMI and schooling levels were associated with higher odds of having 30 or more infant deaths. These associations are consistent with known public-health patterns: HIV burden can strain health systems and contribute to poorer maternal and neonatal outcomes, while low schooling levels often reflect underlying socioeconomic disadvantage and reduced access to preventive services. Lower average BMI may also indicate nutritional challenges or food insecurity, which can indirectly affect infant health.
Importantly, LASSO focuses on predictive importance rather than statistical significance. Thus, the variables retained by the model should be interpreted as the most informative for distinguishing high-risk countries, not necessarily as causal factors. Nonetheless, the results highlight key structural and health-system indicators that tend to differentiate countries with high infant mortality from those with lower rates. For practitioners, these findings can help prioritize areas for intervention, such as strengthening HIV prevention and treatment programs, improving maternal education, and addressing regional health-system disparities.
# training+validation = 75% vs. testing = 25%
set.seed(123)
trainval <- createDataPartition(y=df$Life_expectancy, p=0.75, list=FALSE)
dftrval <- df[trainval,]
dftest <- df[-trainval,]
# training = 50% vs. Validation = 25%
set.seed(123)
train <- createDataPartition(y=dftrval$Life_expectancy, p=2/3, list=FALSE)
dftr <- dftrval[train,]
dfval <- dftrval[-train,]
# Verify the data
dim(df); dim(dftrval); dim(dftest); dim(dftr); dim(dfval)
## [1] 2864 14
## [1] 2150 14
## [1] 714 14
## [1] 1435 14
## [1] 715 14
# Remove Country
dftrval <- dftrval %>%
select(-Country)
dftest <- dftest %>%
select(-Country)
dftr <- dftr %>%
select(-Country)
dfval <- dfval %>%
select(-Country)
# Check structure
str(dftrval)
## 'data.frame': 2150 obs. of 13 variables:
## $ Region : Factor w/ 9 levels "Africa","Asia",..: 4 2 9 5 3 8 4 5 8 9 ...
## $ Year : int 2015 2007 2006 2012 2006 2015 2000 2001 2008 2012 ...
## $ Infant_deaths : num 2.7 51.5 32.8 3.4 9.8 6.6 8.7 22 15.3 15.4 ...
## $ Measles : int 94 35 74 89 86 97 99 87 92 70 ...
## $ BMI : num 26 21.2 25.3 27 26.4 26.2 25.9 27.9 26.5 26.1 ...
## $ Incidents_HIV : num 0.09 0.13 0.79 0.08 0.16 0.08 0.08 0.13 0.43 0.24 ...
## $ GDP_per_capita : int 25742 1076 4146 33995 9110 9313 8971 3708 2235 9057 ...
## $ Thinness_five_nine_years : num 0.5 28 5.5 1.1 1.9 2.3 2.3 3.9 3.1 2.8 ...
## $ Schooling : num 9.7 5 7.9 12.8 7.9 12 10.2 9.6 10.9 7.3 ...
## $ Economy_status_Developed : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 1 ...
## $ Economy_status_Developing: Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 2 2 2 ...
## $ Life_expectancy : num 82.8 65.4 67 81.7 78.2 71.2 71.2 71.9 68.7 74.2 ...
## $ InfantDeaths_over30 : num 0 1 1 0 0 0 0 0 0 0 ...
library(dplyr)
library(splines)
library(boot)
# 1) smoothing spline + MSE
# 2) natural spline (# of knots as CV) + MSE
# 3) train data scatter plot + two lines
fit_plot_splines <- function(xvar, df_train, df_val) {
# x, y
x_train <- df_train[[xvar]]
y_train <- df_train$Life_expectancy
x_val <- df_val[[xvar]]
y_val <- df_val$Life_expectancy
## 1. smoothing spline (df: cv=TRUE)
ss_fit <- smooth.spline(x = x_train, y = y_train, cv = TRUE)
ss_df <- ss_fit$df
# MSE
pred_ss_val <- predict(ss_fit, x = x_val)$y
MSE_ss <- mean((y_val - pred_ss_val)^2)
## 2. natural spline (# of knots as CV)
cand_knots <- 3:8
cv_mse <- rep(NA, length(cand_knots))
df_train_tmp <- df_train %>%
mutate(x_tmp = .data[[xvar]])
for (i in seq_along(cand_knots)) {
k <- cand_knots[i]
# equally spaced quantiles: set interior knots
probs <- seq(0, 1, length.out = k + 2)
knots_i <- quantile(x_train, probs = probs)[-c(1, k + 2)]
fit_tmp <- glm(Life_expectancy ~ ns(x_tmp, knots = knots_i),
data = df_train_tmp)
cv_res <- cv.glm(df_train_tmp, fit_tmp, K = 10)
cv_mse[i] <- cv_res$delta[1]
}
best_k <- cand_knots[which.min(cv_mse)]
probs_best <- seq(0, 1, length.out = best_k + 2)
best_knots <- quantile(x_train, probs = probs_best)[-c(1, best_k + 2)]
ns_fit <- lm(Life_expectancy ~ ns(x_tmp, knots = best_knots),
data = df_train_tmp)
df_val_tmp <- df_val %>%
mutate(x_tmp = .data[[xvar]])
pred_ns_val <- predict(ns_fit, newdata = df_val_tmp)
MSE_ns <- mean((y_val - pred_ns_val)^2)
## 3. Plots (training data)
x_grid <- seq(min(x_train), max(x_train), length.out = 100)
ss_grid <- predict(ss_fit, x_grid)$y
grid_df <- data.frame(x_tmp = x_grid)
ns_grid <- predict(ns_fit, newdata = grid_df)
plot(x_train, y_train,
xlab = xvar,
ylab = "Life_expectancy",
main = paste("Life_expectancy vs", xvar))
lines(x_grid, ss_grid, col = "blue", lwd = 2)
lines(x_grid, ns_grid, col = "red", lwd = 2)
legend("topleft",
legend = c(
paste0("Smoothing spline (df = ", round(ss_df, 1), ")"),
paste0("Natural spline (knots = ", best_k, ")")
),
col = c("blue", "red"),
lty = 1, lwd = 2, bty = "n")
# Values for the table
out <- list(
var = xvar,
df_ss = ss_df,
MSE_ss = MSE_ss,
knots_ns = best_k,
MSE_ns = MSE_ns
)
return(out)
}
res_year <- fit_plot_splines("Year", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful
res_year
## $var
## [1] "Year"
##
## $df_ss
## [1] 2.000005
##
## $MSE_ss
## [1] 84.61247
##
## $knots_ns
## [1] 7
##
## $MSE_ns
## [1] 85.19888
res_infant <- fit_plot_splines("Infant_deaths", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful
res_infant
## $var
## [1] "Infant_deaths"
##
## $df_ss
## [1] 22.69609
##
## $MSE_ss
## [1] 10.5828
##
## $knots_ns
## [1] 5
##
## $MSE_ns
## [1] 10.54443
res_gdp <- fit_plot_splines("GDP_per_capita", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful
res_gdp
## $var
## [1] "GDP_per_capita"
##
## $df_ss
## [1] 27.06004
##
## $MSE_ss
## [1] 29.62522
##
## $knots_ns
## [1] 4
##
## $MSE_ns
## [1] 29.45887
res_measles <- fit_plot_splines("Measles", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful
res_measles
## $var
## [1] "Measles"
##
## $df_ss
## [1] 21.67301
##
## $MSE_ss
## [1] 43.54895
##
## $knots_ns
## [1] 8
##
## $MSE_ns
## [1] 49.43904
res_bmi <- fit_plot_splines("BMI", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful
res_bmi
## $var
## [1] "BMI"
##
## $df_ss
## [1] 29.01203
##
## $MSE_ss
## [1] 42.95983
##
## $knots_ns
## [1] 8
##
## $MSE_ns
## [1] 43.90295
res_hiv <- fit_plot_splines("Incidents_HIV", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful
res_hiv
## $var
## [1] "Incidents_HIV"
##
## $df_ss
## [1] 9.369317
##
## $MSE_ss
## [1] 40.8478
##
## $knots_ns
## [1] 8
##
## $MSE_ns
## [1] 38.76133
res_school <- fit_plot_splines("Schooling", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful
res_school
## $var
## [1] "Schooling"
##
## $df_ss
## [1] 40.45272
##
## $MSE_ss
## [1] 37.6862
##
## $knots_ns
## [1] 8
##
## $MSE_ns
## [1] 37.33973
res_thin <- fit_plot_splines("Thinness_five_nine_years", dftr, dfval)
## Warning in smooth.spline(x = x_train, y = y_train, cv = TRUE): cross-validation
## with non-unique 'x' values seems doubtful
res_thin
## $var
## [1] "Thinness_five_nine_years"
##
## $df_ss
## [1] 34.64684
##
## $MSE_ss
## [1] 46.68214
##
## $knots_ns
## [1] 8
##
## $MSE_ns
## [1] 50.23682
# Null model: intercept only
null_fit <- lm(Life_expectancy ~ 1, data = dftr)
# MSE
null_pred_val <- predict(null_fit, newdata = dfval)
MSE_null <- mean((dfval$Life_expectancy - null_pred_val)^2)
MSE_null
## [1] 87.29366
| Variable Name | MSE from Smoothing Spline | S moothing Spline (SS) DF | MSE from ns() | # knots for ns() | Better parameterization? |
|---|---|---|---|---|---|
| Intercept only | MSE intercept only | NULL MSE=87.3 | |||
| Year | 84.6 | df= 2.0 | 85.3 | ns() # knot= 5 | SS |
| Infant_ deaths | 10.6 | df= 22.7 | 10.5 | ns() # knot= 8 | NS |
| GDP_per_ capita | 29.6 | df= 27.1 | 29.5 | ns() # knot= 4 | NS |
| Measles | 43.5 | df= 21.7 | 49.4 | ns() # knot= 8 | SS |
| BMI | 43.0 | df= 29.0 | 43.9 | ns() # knot= 8 | SS |
| Incidents_ HIV | 40.8 | df= 9.4 | 38.8 | ns() # knot= 8 | NS |
| Scholling | 37.7 | df= 40.5 | 37.3 | ns() # knot= 8 | NS |
| Thinness_ five_nine_ years | 46.7 | df= 34.6 | 50.2 | ns() # knot= 8 | SS |
| Variable Name | MSE |
|---|---|
| Region | 32.76 |
| Econ_Developed | 64.43 |
| Econ_Developing | 64.43 |
cat_vars <- names(dftr)[sapply(dftr, function(x) is.factor(x) | is.character(x))]
cat_vars <- setdiff(cat_vars, "Life_expectancy")
cat_vars
## [1] "Region" "Economy_status_Developed"
## [3] "Economy_status_Developing"
# Boxplots
for (v in cat_vars) {
boxplot(dftr$Life_expectancy ~ dftr[[v]],
main = paste("Life_expectancy by", v),
xlab = v,
ylab = "Life_expectancy")
}
# regression analysis + validation MSE
mse_cat <- data.frame(
Variable = cat_vars,
MSE = NA_real_,
stringsAsFactors = FALSE
)
for (i in seq_along(cat_vars)) {
v <- cat_vars[i]
# Life_expectancy ~ categorical variable
fml <- as.formula(paste("Life_expectancy ~", v))
fit <- lm(fml, data = dftr)
pred_val <- predict(fit, newdata = dfval)
mse_cat$MSE[i] <- mean((dfval$Life_expectancy - pred_val)^2)
}
mse_cat
## Variable MSE
## 1 Region 32.75974
## 2 Economy_status_Developed 64.43176
## 3 Economy_status_Developing 64.43176
Based on the univariate analyses, Infant_deaths showed the strongest relationship with Life_expectancy. This variable produced by far the lowest prediction errors, with MSE values of 10.54-10.58, indicating a very strong inverse association between infant death and life expectancy.
GDP_per_capita also showed relatively strong univariate relationship. It had an MSE of 29.46–29.63, making it the second strongest continuous predictor.
Region was the third strongest predictor with a validation MSE of 32.76.
Schooling (MSE 37.34–37.69) and Incidents_HIV (MSE 38.76–40.85) showed moderate predictive strength as well.
Overall, Infant_deaths is the strongest univariate predictor of Life_expectancy, followed by GDP_per_capita. Region, Schooling, Incidents_HIV, and Region also show meaningful univariate associations.
library(gam)
library(splines)
# Full model using best spline choices for each predictor
full_model <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(Measles, df = 8) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
summary(full_model)
##
## Call: gam(formula = Life_expectancy ~ s(Year, df = 2) + ns(Infant_deaths,
## df = 10) + ns(GDP_per_capita, df = 6) + s(Measles, df = 8) +
## s(BMI, df = 8) + ns(Incidents_HIV, df = 10) + ns(Schooling,
## df = 10) + s(Thinness_five_nine_years, df = 8) + Region,
## data = dftr)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.88717 -1.02089 0.05979 1.14652 6.48500
##
## (Dispersion Parameter for gaussian family taken to be 3.4898)
##
## Null Deviance: 127300.2 on 1434 degrees of freedom
## Residual Deviance: 4760.14 on 1364 degrees of freedom
## AIC: 5937.08
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(Year, df = 2) 1 3900 3899.7 1117.4443 < 2.2e-16 ***
## ns(Infant_deaths, df = 10) 10 97528 9752.8 2794.6174 < 2.2e-16 ***
## ns(GDP_per_capita, df = 6) 6 1008 168.0 48.1264 < 2.2e-16 ***
## s(Measles, df = 8) 1 1 1.3 0.3857 0.53470
## s(BMI, df = 8) 1 59 58.9 16.8835 4.211e-05 ***
## ns(Incidents_HIV, df = 10) 10 7448 744.8 213.4231 < 2.2e-16 ***
## ns(Schooling, df = 10) 10 529 52.9 15.1445 < 2.2e-16 ***
## s(Thinness_five_nine_years, df = 8) 1 10 9.9 2.8422 0.09205 .
## Region 8 636 79.5 22.7686 < 2.2e-16 ***
## Residuals 1364 4760 3.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(Year, df = 2) 1 19.057 1.364e-05 ***
## ns(Infant_deaths, df = 10)
## ns(GDP_per_capita, df = 6)
## s(Measles, df = 8) 7 7.073 2.571e-08 ***
## s(BMI, df = 8) 7 10.603 4.816e-13 ***
## ns(Incidents_HIV, df = 10)
## ns(Schooling, df = 10)
## s(Thinness_five_nine_years, df = 8) 7 12.467 1.554e-15 ***
## Region
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# predictions on validation set
pred_val <- predict(full_model, newdata = dfval)
# validation MSE
MSE_full <- mean((dfval$Life_expectancy - pred_val)^2)
MSE_full
## [1] 3.847604
Comment: I fitted a GAM on the training set using the best-performing spline parameterizations from the univariate analyses (NS for Infant_deaths, GDP_per_capita, Incidents_HIV, and Schooling; SS for Year, Measles, BMI, and Thinness; and Region as a factor). The model significantly reduced deviance (from 127,300 to 4,760) and showed strong nonparametric effects for several predictors. The validation MSE was 3.85, indicating that the multivariable spline model provides substantial improvement in predictive accuracy over the univariate models.
# Full model validation MSE -> base MSE
best_MSE <- MSE_full
best_MSE
## [1] 3.847604
# Model without Year
mod_noYear <- gam(
Life_expectancy ~
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(Measles, df = 8) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
# validation MSE
pred_noYear <- predict(mod_noYear, newdata = dfval)
MSE_noYear <- mean((dfval$Life_expectancy - pred_noYear)^2)
MSE_noYear
## [1] 3.933785
Comment: MSE changed from 3.847604 to 3.933785. Removing Year worsened prediction error, so Year was kept in the model.
Best MSE remains 3.847604.
# Best MSE carried over from previous step
best_MSE
## [1] 3.847604
# Model without Infant_deaths
mod_noInfant <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(GDP_per_capita, df = 6) +
s(Measles, df = 8) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
# validation MSE
pred_noInfant <- predict(mod_noInfant, newdata = dfval)
MSE_noInfant <- mean((dfval$Life_expectancy - pred_noInfant)^2)
MSE_noInfant
## [1] 9.168095
Comment: MSE changed from 3.847604 to 9.168095. Removing Infant_deaths worsened prediction error, so Infant_deaths was kept in the model.
Best MSE remains 3.847604.
# Best MSE carried over from previous step
best_MSE
## [1] 3.847604
# Model without GDP_per_capita
mod_noGDP <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
s(Measles, df = 8) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
# validation MSE
pred_noGDP <- predict(mod_noGDP, newdata = dfval)
MSE_noGDP <- mean((dfval$Life_expectancy - pred_noGDP)^2)
MSE_noGDP
## [1] 4.036478
Comment: MSE changed from 3.847604 to 4.036478. Removing GDP_per_capita worsened prediction error, so GDP_per_capita was kept in the model.
Best MSE remains 3.847604.
# Best MSE carried over from previous step
best_MSE
## [1] 3.847604
# Model without Measles
mod_noMeasles <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
# validation MSE
pred_noMeasles <- predict(mod_noMeasles, newdata = dfval)
MSE_noMeasles <- mean((dfval$Life_expectancy - pred_noMeasles)^2)
MSE_noMeasles
## [1] 3.816505
Comment: MSE changed from 3.847604 to 3.816505. Removing Measles improved prediction error, so Measles was removed from the model.
Best MSE is now 3.816505.
# Best MSE updated
best_MSE <- MSE_noMeasles
best_MSE
## [1] 3.816505
# Model without BMI
mod_noBMI <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
# validation MSE
pred_noBMI <- predict(mod_noBMI, newdata = dfval)
MSE_noBMI <- mean((dfval$Life_expectancy - pred_noBMI)^2)
MSE_noBMI
## [1] 3.934864
Comment: MSE changed from 3.816505 to 3.934864. Removing BMI worsened prediction error, so BMI was kept in the model.
Best MSE remains 3.816505.
# Best MSE carried over from previous step
best_MSE <- MSE_noMeasles # 3.816505
best_MSE
## [1] 3.816505
# Model without Incidents_HIV
mod_noHIV <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(BMI, df = 8) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
# validation MSE
pred_noHIV <- predict(mod_noHIV, newdata = dfval)
MSE_noHIV <- mean((dfval$Life_expectancy - pred_noHIV)^2)
MSE_noHIV
## [1] 7.165852
Comment: MSE changed from 3.816505 to 7.165852. Removing Incidents_HIV worsened prediction error, so Incidents_HIV was kept in the model.
Best MSE remains 3.816505.
# Best MSE carried over from previous step
best_MSE <- MSE_noMeasles # 3.816505
best_MSE
## [1] 3.816505
# Model without Schooling
mod_noSchool <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
# validation MSE
pred_noSchool <- predict(mod_noSchool, newdata = dfval)
MSE_noSchool <- mean((dfval$Life_expectancy - pred_noSchool)^2)
MSE_noSchool
## [1] 4.049667
Comment: MSE changed from 3.816505 to 4.049667. Removing Schooling worsened prediction error, so Schooling was kept in the model.
Best MSE remains 3.847604.
# Best MSE carried over from previous step
best_MSE <- MSE_noMeasles # 3.816505
best_MSE
## [1] 3.816505
# Model without Thinness_five_nine_years
mod_noThin <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
Region,
data = dftr
)
# validation MSE
pred_noThin <- predict(mod_noThin, newdata = dfval)
MSE_noThin <- mean((dfval$Life_expectancy - pred_noThin)^2)
MSE_noThin
## [1] 3.969929
Comment: MSE changed from 3.816505 to 3.969929. Removing Thinness_five_nine_years worsened prediction error, so Thinness_five_nine_years was kept in the model.
Best MSE remains 3.816505.
# Best MSE carried over from previous step
best_MSE <- MSE_noMeasles # 3.816505
best_MSE
## [1] 3.816505
# Model without Region
mod_noRegion <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8),
data = dftr
)
# validation MSE
pred_noRegion <- predict(mod_noRegion, newdata = dfval)
MSE_noRegion <- mean((dfval$Life_expectancy - pred_noRegion)^2)
MSE_noRegion
## [1] 4.345767
Comment: MSE changed from 3.816505 to 4.345767. Removing Region worsened prediction error, so Region was kept in the model.
Best MSE remains 3.816505.
# Summarize MSE for each model
MSE_full
## [1] 3.847604
MSE_noYear
## [1] 3.933785
MSE_noInfant
## [1] 9.168095
MSE_noGDP
## [1] 4.036478
MSE_noMeasles
## [1] 3.816505
MSE_noBMI
## [1] 3.934864
MSE_noHIV
## [1] 7.165852
MSE_noSchool
## [1] 4.049667
MSE_noThin
## [1] 3.969929
MSE_noRegion
## [1] 4.345767
| Best MSE at Start | Model (Remove Variable) | Validation Set MSE | Decision | Best MSE at End |
|---|---|---|---|---|
| 3.847604 | Full Multivariable Model | 3.847604 | Base MSE | 3.847604 |
| 3.847604 | Remove Year | 3.933785 | MSE ↑ → KEEP | 3.847604 |
| 3.847604 | Remove Infant_deaths | 9.168095 | MSE ↑ → KEEP | 3.847604 |
| 3.847604 | Remove GDP_per_capita | 4.036478 | MSE ↑ → KEEP | 3.847604 |
| 3.847604 | Remove Measles | 3.816505 | MSE ↓ → REMOVE | 3.816505 |
| 3.816505 | Remove BMI | 3.934864 | MSE ↑ → KEEP | 3.816505 |
| 3.816505 | Remove Incidents_HIV | 7.165852 | MSE ↑ → KEEP | 3.816505 |
| 3.816505 | Remove Schooling | 4.049667 | MSE ↑ → KEEP | 3.816505 |
| 3.816505 | Remove Thinness_5_9_years | 3.969929 | MSE ↑ → KEEP | 3.816505 |
| 3.816505 | Remove Region | 4.345767 | MSE ↑ → KEEP | 3.816505 |
Based on the univariate spline analyses, Infant_deaths showed the strongest relationship with Life_expectancy, producing the lowest validation MSE (approximately 10.54-10.58). This indicates a very strong inverse association, where higher infant deaths predict substantially lower life expectancy.
GDP_per_capita also demonstrated a strong univariate association, with validation MSE around 29.46–29.63, making it the second most predictive continuous variable.
Among the categorical variables, Region was the most informative predictor, with a validation MSE of 32.76, outperforming both developed- and developing-country indicators.
Other continuous predictors such as Schooling, Incidents_HIV, and BMI had moderate univariate relationships (MSE values between 37-43).
Overall, the variables most strongly related to Life_expectancy on a univariate basis are Infant_deaths, GDP_per_capita, and Region.
# Final model includes all variables (Year, Infant_deaths, GDP_per_capita, BMI, Incidents_HIV, Schooling, Thinness_five_nine_years, and Region) except Measles.
final_model <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftr
)
summary(final_model)
##
## Call: gam(formula = Life_expectancy ~ s(Year, df = 2) + ns(Infant_deaths,
## df = 10) + ns(GDP_per_capita, df = 6) + s(BMI, df = 8) +
## ns(Incidents_HIV, df = 10) + ns(Schooling, df = 10) + s(Thinness_five_nine_years,
## df = 8) + Region, data = dftr)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.27633 -0.99283 0.01517 1.19476 6.50034
##
## (Dispersion Parameter for gaussian family taken to be 3.5809)
##
## Null Deviance: 127300.2 on 1434 degrees of freedom
## Residual Deviance: 4913.04 on 1372 degrees of freedom
## AIC: 5966.448
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(Year, df = 2) 1 3796 3795.5 1059.9233 < 2.2e-16 ***
## ns(Infant_deaths, df = 10) 10 99099 9909.9 2767.4130 < 2.2e-16 ***
## ns(GDP_per_capita, df = 6) 6 1092 182.1 50.8471 < 2.2e-16 ***
## s(BMI, df = 8) 1 60 59.7 16.6838 4.671e-05 ***
## ns(Incidents_HIV, df = 10) 10 7303 730.3 203.9414 < 2.2e-16 ***
## ns(Schooling, df = 10) 10 563 56.3 15.7166 < 2.2e-16 ***
## s(Thinness_five_nine_years, df = 8) 1 2 1.7 0.4764 0.4902
## Region 8 710 88.7 24.7756 < 2.2e-16 ***
## Residuals 1372 4913 3.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(Year, df = 2) 1 20.895 5.287e-06 ***
## ns(Infant_deaths, df = 10)
## ns(GDP_per_capita, df = 6)
## s(BMI, df = 8) 7 10.185 1.752e-12 ***
## ns(Incidents_HIV, df = 10)
## ns(Schooling, df = 10)
## s(Thinness_five_nine_years, df = 8) 7 15.146 < 2.2e-16 ***
## Region
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Final model fitted on training + validation (dftrval)
final_model <- gam(
Life_expectancy ~
s(Year, df = 2) +
ns(Infant_deaths, df = 10) +
ns(GDP_per_capita, df = 6) +
s(BMI, df = 8) +
ns(Incidents_HIV, df = 10) +
ns(Schooling, df = 10) +
s(Thinness_five_nine_years, df = 8) +
Region,
data = dftrval
)
summary(final_model)
##
## Call: gam(formula = Life_expectancy ~ s(Year, df = 2) + ns(Infant_deaths,
## df = 10) + ns(GDP_per_capita, df = 6) + s(BMI, df = 8) +
## ns(Incidents_HIV, df = 10) + ns(Schooling, df = 10) + s(Thinness_five_nine_years,
## df = 8) + Region, data = dftrval)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.335666 -1.020961 0.002681 1.145005 7.848873
##
## (Dispersion Parameter for gaussian family taken to be 3.6013)
##
## Null Deviance: 189715 on 2149 degrees of freedom
## Residual Deviance: 7515.88 on 2087 degrees of freedom
## AIC: 8920.268
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(Year, df = 2) 1 5884 5883.5 1633.7290 < 2.2e-16 ***
## ns(Infant_deaths, df = 10) 10 148042 14804.2 4110.7996 < 2.2e-16 ***
## ns(GDP_per_capita, df = 6) 6 1442 240.3 66.7147 < 2.2e-16 ***
## s(BMI, df = 8) 1 34 34.4 9.5403 0.002037 **
## ns(Incidents_HIV, df = 10) 10 10504 1050.4 291.6793 < 2.2e-16 ***
## ns(Schooling, df = 10) 10 823 82.3 22.8659 < 2.2e-16 ***
## s(Thinness_five_nine_years, df = 8) 1 3 2.9 0.8126 0.367469
## Region 8 1142 142.8 39.6547 < 2.2e-16 ***
## Residuals 2087 7516 3.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(Year, df = 2) 1 17.490 3.007e-05 ***
## ns(Infant_deaths, df = 10)
## ns(GDP_per_capita, df = 6)
## s(BMI, df = 8) 7 14.079 < 2.2e-16 ***
## ns(Incidents_HIV, df = 10)
## ns(Schooling, df = 10)
## s(Thinness_five_nine_years, df = 8) 7 20.543 < 2.2e-16 ***
## Region
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Testing MSE
# Predict using the final model on the testing set
pred_test <- predict(final_model, newdata = dftest)
# Compute test MSE
MSE_test <- mean((dftest$Life_expectancy - pred_test)^2)
MSE_test
## [1] 3.740605
# Predicted vs Observed Plot
plot(
dftest$Life_expectancy, pred_test,
xlab = "Observed Life Expectancy",
ylab = "Predicted Life Expectancy",
main = "Predicted vs Observed Life Expectancy (Testing Set)",
pch = 19, col = "darkgray"
)
# Add x=y reference line
abline(a = 0, b = 1, col = "red", lwd = 2)
Comment: The testing MSE for the final model was 3.740605, indicating strong predictive performance on unseen data.
The predicted vs. observed plot shows that the predictions from the final model align closely with the 45-degree reference line. Most points fall tightly around the line across the full range of life expectancy values, suggesting that the model captures the underlying relationship well. While some variability appears at the lower and upper extremes, the overall fit indicates that the model predicts life expectancy accurately for the test data.
Our goal was to build a model that accurately predicts life expectancy based on a country’s demographic, health, and economic characteristics. Using data-driven variable selection and spline-based modeling to capture nonlinear relationships, the final model performed well, achieving a low testing MSE of approximately 3.74.
Across all predictors considered, infant mortality and GDP per capita were among the strongest contributors to explaining life expectancy, showing clear nonlinear associations. Schooling, HIV incidence, BMI, and the thinness indicator also provided meaningful predictive information, while geographic region added important categorical structure. Measles incidence was the only variable that did not improve model performance and was removed during backward selection.
Overall, the model generalizes well and makes accurate predictions across the full range of life expectancy values, as shown by the close alignment between predicted and observed outcomes in the testing set. This suggests that the included variables collectively capture key population characteristics associated with national life expectancy. For an investigator, this model can serve as a reliable tool for predicting life expectancy and for understanding which factors most strongly influence these predictions.